Clustering and classification

Clustering and classification are visual ways of exploring statistical data. In clustering, the aim is to group a set of objects in such a way that objects in the same group, i.e. cluster, are more similar (in some sense or another) to each other than to those in other groups clusters.

Overview of the data

This week I am using a built-in dataset, namely the Boston dataset which contains housing information in the Boston Mass are. The data has been collected by the U.S Census Service and it is also available online.

The Boston dataset consists of just 506 observations and there are 14 variables. Let us take a closer look at what those variables are.

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The variable names are not exactly self-explanatory. Because our analysis will again depend upon them, I will briefly explain what they are.

Variable Description
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town
chas Charles River dummy variable (1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio by town
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat % lower status of the population
medv median value of owner-occupied homes in $1000’s

Thus, the Boston dataset consists of a variety of variables: demographic, economic, and environmental factors as well as safety.
Let us next take a graphical tour of the data. In the figure below, each variable is plotted against the other variables.

Here we already see patterns emerging, such as * accumulation close to the edges and corners of the box - e.g. age/lstat
* diagonal shapes - e.g. nox/dis
* round shapes close to one of the corners - e.g. rad/tax
* binary positionas in all pairs - e.g chas and rad

SUMMARIES OF VARIABLES - given below with summary of scaled

Correlation matrix

DESCRIBE AND INTERPRET THE OUTPUTS (COMMENT ON DISTRIBUTIONS OF THE VARIABLES & RELATIONSHIPS BETWEEN THEM)

4 Standardization standardize dataset + summary:

##       crim            zn            indus            chas      
##  Min.   : 0.0   Min.   :  0.0   Min.   : 0.46   Min.   :0.000  
##  1st Qu.: 0.1   1st Qu.:  0.0   1st Qu.: 5.19   1st Qu.:0.000  
##  Median : 0.3   Median :  0.0   Median : 9.69   Median :0.000  
##  Mean   : 3.6   Mean   : 11.4   Mean   :11.14   Mean   :0.069  
##  3rd Qu.: 3.7   3rd Qu.: 12.5   3rd Qu.:18.10   3rd Qu.:0.000  
##  Max.   :89.0   Max.   :100.0   Max.   :27.74   Max.   :1.000  
##       nox              rm            age             dis       
##  Min.   :0.385   Min.   :3.56   Min.   :  2.9   Min.   : 1.13  
##  1st Qu.:0.449   1st Qu.:5.89   1st Qu.: 45.0   1st Qu.: 2.10  
##  Median :0.538   Median :6.21   Median : 77.5   Median : 3.21  
##  Mean   :0.555   Mean   :6.28   Mean   : 68.6   Mean   : 3.80  
##  3rd Qu.:0.624   3rd Qu.:6.62   3rd Qu.: 94.1   3rd Qu.: 5.19  
##  Max.   :0.871   Max.   :8.78   Max.   :100.0   Max.   :12.13  
##       rad             tax         ptratio         black         lstat     
##  Min.   : 1.00   Min.   :187   Min.   :12.6   Min.   :  0   Min.   : 1.7  
##  1st Qu.: 4.00   1st Qu.:279   1st Qu.:17.4   1st Qu.:375   1st Qu.: 7.0  
##  Median : 5.00   Median :330   Median :19.1   Median :391   Median :11.4  
##  Mean   : 9.55   Mean   :408   Mean   :18.5   Mean   :357   Mean   :12.7  
##  3rd Qu.:24.00   3rd Qu.:666   3rd Qu.:20.2   3rd Qu.:396   3rd Qu.:17.0  
##  Max.   :24.00   Max.   :711   Max.   :22.0   Max.   :397   Max.   :38.0  
##       medv     
##  Min.   : 5.0  
##  1st Qu.:17.0  
##  Median :21.2  
##  Mean   :22.5  
##  3rd Qu.:25.0  
##  Max.   :50.0
##       crim             zn            indus             chas      
##  Min.   :-0.42   Min.   :-0.49   Min.   :-1.556   Min.   :-0.27  
##  1st Qu.:-0.41   1st Qu.:-0.49   1st Qu.:-0.867   1st Qu.:-0.27  
##  Median :-0.39   Median :-0.49   Median :-0.211   Median :-0.27  
##  Mean   : 0.00   Mean   : 0.00   Mean   : 0.000   Mean   : 0.00  
##  3rd Qu.: 0.01   3rd Qu.: 0.05   3rd Qu.: 1.015   3rd Qu.:-0.27  
##  Max.   : 9.92   Max.   : 3.80   Max.   : 2.420   Max.   : 3.66  
##       nox               rm             age              dis       
##  Min.   :-1.464   Min.   :-3.88   Min.   :-2.333   Min.   :-1.27  
##  1st Qu.:-0.912   1st Qu.:-0.57   1st Qu.:-0.837   1st Qu.:-0.80  
##  Median :-0.144   Median :-0.11   Median : 0.317   Median :-0.28  
##  Mean   : 0.000   Mean   : 0.00   Mean   : 0.000   Mean   : 0.00  
##  3rd Qu.: 0.598   3rd Qu.: 0.48   3rd Qu.: 0.906   3rd Qu.: 0.66  
##  Max.   : 2.730   Max.   : 3.55   Max.   : 1.116   Max.   : 3.96  
##       rad              tax            ptratio           black      
##  Min.   :-0.982   Min.   :-1.313   Min.   :-2.705   Min.   :-3.90  
##  1st Qu.:-0.637   1st Qu.:-0.767   1st Qu.:-0.488   1st Qu.: 0.20  
##  Median :-0.522   Median :-0.464   Median : 0.275   Median : 0.38  
##  Mean   : 0.000   Mean   : 0.000   Mean   : 0.000   Mean   : 0.00  
##  3rd Qu.: 1.660   3rd Qu.: 1.529   3rd Qu.: 0.806   3rd Qu.: 0.43  
##  Max.   : 1.660   Max.   : 1.796   Max.   : 1.637   Max.   : 0.44  
##      lstat            medv       
##  Min.   :-1.53   Min.   :-1.906  
##  1st Qu.:-0.80   1st Qu.:-0.599  
##  Median :-0.18   Median :-0.145  
##  Mean   : 0.00   Mean   : 0.000  
##  3rd Qu.: 0.60   3rd Qu.: 0.268  
##  Max.   : 3.55   Max.   : 2.987
## [1] "matrix"

HOW DID THE VARIABLES CHANGE?

create categorical variable of crime rate (from scaled crime rate), quantiles as breakpoints:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -0.42   -0.41   -0.39    0.00    0.01    9.92
##       0%      25%      50%      75%     100% 
## -0.41937 -0.41056 -0.39028  0.00739  9.92411
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Division into train and test sets divide dataset into train and test so that 80% of data belongs to train set:

Linear discriminant analysis (LDA) fit linear discriminant analysis to train set (categorical crime rate as target variable & all others as predictor variables) + draw LDA (bi)plot:

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##      low  med_low med_high     high 
##    0.262    0.248    0.240    0.250 
## 
## Group means:
##              zn  indus    chas    nox      rm    age    dis    rad    tax
## low       0.974 -0.905 -0.0866 -0.887  0.4588 -0.887  0.884 -0.690 -0.741
## med_low  -0.107 -0.307  0.0426 -0.567 -0.0952 -0.342  0.309 -0.548 -0.475
## med_high -0.388  0.187  0.2553  0.393  0.0778  0.447 -0.365 -0.453 -0.336
## high     -0.487  1.017 -0.1164  1.036 -0.4532  0.831 -0.868  1.638  1.514
##          ptratio   black   lstat   medv
## low       -0.466  0.3830 -0.7648  0.536
## med_low   -0.121  0.3187 -0.1936  0.047
## med_high  -0.267  0.0804  0.0429  0.148
## high       0.781 -0.6543  0.8986 -0.691
## 
## Coefficients of linear discriminants:
##             LD1     LD2     LD3
## zn       0.0630  0.7025 -1.0304
## indus    0.0329 -0.1325  0.2351
## chas    -0.1013 -0.0895  0.0750
## nox      0.3605 -0.8075 -1.4696
## rm      -0.1436 -0.0672 -0.1826
## age      0.1682 -0.3541 -0.1768
## dis     -0.0245 -0.2903 -0.0200
## rad      3.6157  1.0492 -0.0716
## tax      0.1351 -0.0891  0.6960
## ptratio  0.1054 -0.0371 -0.4492
## black   -0.1022  0.0710  0.1395
## lstat    0.3075 -0.3059  0.1675
## medv     0.2764 -0.4317 -0.2875
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9566 0.0333 0.0101

save crime categories from the test set:

Prediction predit the classes with LDA model on test data + crosstabulate with crime categories from the test set (= correct_classes):

##           predicted
## correct    low med_low med_high high
##   low       11      10        0    0
##   med_low    5      13        8    0
##   med_high   1       6       19    3
##   high       0       0        0   26

COMMENT ON RESULTS

K-means clustering reload Boston + standardize

calculate distances between observations:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.13    3.46    4.82    4.91    6.19   14.40

run k-means clustering [Euclidean]:

investigate optimal number of clusters & run k-means again: INTERPRET RESULTS

3D plot

We were also given the code for producing a 3D plot. I did not have time to study it in detail, but I absolutely had to see what what it looks like, so here it is!

## [1] 404  13
## [1] 13  3

check colours (should be the crime classes of the train set)